New adaptive compressors for natural language text

نویسندگان

  • Nieves R. Brisaboa
  • Antonio Fariña
  • Gonzalo Navarro
  • José R. Paramá
چکیده

Semistatic byte-oriented word-based compression codes have been shown to be an attractive alternative to compress natural language text databases, because of the combination of speed, effectiveness, and direct searchability they offer. In particular, our recently proposed family of dense compression codes has been shown to be superior to the more traditional byte-oriented word-based Huffman codes in most aspects. In this paper, we focus on the problem of transmitting texts among peers that do not share the vocabulary. This is the typical scenario for adaptive compression methods. We design adaptive variants of our semistatic dense codes, showing that they are much simpler and faster than dynamic Huffman codes and reach almost the same compression effectiveness. We show that our variants have a very compelling trade-off between compression/decompression speed, compression ratio and search speed compared with most of the state-of-the-art general compressors.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Boosting Text Compression with Word-Based Statistical Encoding

Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30-35%, they allow fast direct searching of compressed text. In this article we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but t...

متن کامل

Natural Language Compression on Edge-Guided text preprocessing

This paper presents Edge-Guided (E-G), an optimized text preprocessing technique for compression purposes. It transforms the original text into a word net, which stores all relationships between adjoining words. A specific directed graph is proposed to model this transformation: words are stored in vertices, whereas edges represent word transitions. Thus, the word net has a text representation ...

متن کامل

Adaptive Compression Techniques and Efficient Query Evaluation for XML Databases-An overview

Extensible Markup Language (XML) is proposed as a standardized data format designed for specifying and exchanging data on the Web. With the proliferation of mobile devices, such as palmtop computers, as a means of communication in recent years, it is reasonable to expect that in the foreseeable future, a massive amount of XML data will be generated and exchanged between applications in order to...

متن کامل

Natural scene text localization using edge color signature

Localizing text regions in images taken from natural scenes is one of the challenging problems dueto variations in font, size, color and orientation of text. In this paper, we introduce a new concept socalled Edge Color Signature for localizing text regions in an image. This method is able to localizeboth Farsi and English texts. In the proposed method rst a pyramid using diff...

متن کامل

An Adaptive Algorithm for Text Detection from Natural Scenes

We present a new adaptive algorithm for automatic detection of text from a natural scene. The initial cues of text regions are first detected from the captured image/video. An adaptive color modeling and searching algorithm is then utilized near the initial text cues, to discriminate text/non-text regions. EM optimization algorithm is used for color modeling, under the constraint of text layout...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Softw., Pract. Exper.

دوره 38  شماره 

صفحات  -

تاریخ انتشار 2008